E-commerce Agentic RAG & Evaluation¶

This notebook covers the end-to-end deployment and governance of a shopping assistant.

Core Components:¶

  • RAG Setup: ChromaDB with products.json catalog.
  • Multi-Tool Agent: Orchestration via product_search and product_comparison.
  • LLM-as-a-Judge: Evaluates RAG retrieval relevancy, correctness.

Agentic RAG for E-commerce: Orchestration & Evaluation¶

This notebook demonstrates a production-grade Agentic RAG pipeline, featuring:

  • Dynamic Retrieval: Tools that query a vector database on-demand.
  • Orchestration: An agent that reasons through multi-step customer queries.
  • Multi-hop Reasoning: Connecting disparate product specs to form complex answers.
  • Evaluations: Using Arize Phoenix to log traces and run LLM-as-a-Judge relevancy checks.
In [ ]:
ARIZE_API_KEY=""
ARIZE_SPACE_ID=""
OPENAI_API_KEY=""
In [ ]:
# --- 1. SETUP & TRACING ---
# !pip install -qqq arize-otel arize agno openai openinference-instrumentation-agno openinference-instrumentation-openai chromadb sentence-transformers arize-phoenix

import os
import json
from getpass import getpass
from datetime import datetime
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.agno import AgnoInstrumentor

import os
from getpass import getpass

os.environ["ARIZE_SPACE_ID"] = globals().get("ARIZE_SPACE_ID") or getpass("🔑 Enter your Arize Space ID: ")

os.environ["ARIZE_API_KEY"] = globals().get("ARIZE_API_KEY") or getpass("🔑 Enter your Arize API Key: ")

os.environ["OPENAI_API_KEY"] = globals().get("OPENAI_API_KEY") or getpass("🔑 Enter your OpenAI API Key: ")

os.environ["TAVILY_API_KEY"] = globals().get("TAVILY_API_KEY") or getpass("🔑 Enter your Tavily API Key: ")
    
model_id = "ecom-agent-eval-v4"
tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name=model_id,
    set_global_tracer_provider=True
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
AgnoInstrumentor().instrument(tracer_provider=tracer_provider)
🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: ecom-agent-eval-v4
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

Now it’s time to make our local_flavor tool even smarter by giving it access to a rich database of travel destination insights. We’ll use ChromaDB as the vector database and a Sentence Transformer model to generate embeddings that allow the tool to find and retrieve the most relevant information.

In [ ]:
! pip install "numpy<2" torch "transformers<5" sentence-transformers chromadb

2. Knowledge Base & Vector DB¶

We initialize ChromaDB using the SentenceTransformerEmbeddingFunction to avoid hardware acceleration errors.

In [4]:
import chromadb
import chromadb.utils.embedding_functions as ef
from sentence_transformers import SentenceTransformer

# 1. Mock Catalog Data
products = [
    {"product": "LiteProduct 1", "category": "Home Organization", "price": "$25", "rating": "4.5", "reviews": "120", "description": "Minimalist shelf.", "specs": "Plastic, 5kg limit"},
    {"product": "MaxProduct 2", "category": "Home Organization", "price": "$45", "rating": "4.8", "reviews": "85", "description": "Heavy duty organizer.", "specs": "Steel, 20kg limit"},
    {"product": "EcoBin", "category": "Waste Management", "price": "$15", "rating": "4.2", "reviews": "50", "description": "Recycled bin.", "specs": "10L"}
]

# 2. Initialize Chroma with explicit Embedding Function
emb_func = ef.SentenceTransformerEmbeddingFunction(model_name='all-MiniLM-L6-v2')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="ecom_catalog", embedding_function=emb_func)

collection.add(
    documents=[f"{p['product']} in {p['category']}: {p['description']}" for p in products],
    metadatas=products,
    ids=[f"id_{i}" for i in range(len(products))]
)
2026-02-20 23:27:35.356058: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [30]:
from sentence_transformers import SentenceTransformer
import chromadb
import chromadb.utils.embedding_functions as ef
from sentence_transformers import SentenceTransformer
import chromadb.utils.embedding_functions as ef

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# 1. Single initialization of the embedding function
emb_func = ef.SentenceTransformerEmbeddingFunction(model_name='all-MiniLM-L6-v2')
# 2. Setup Client
chroma_client = chromadb.Client() 
collection = chroma_client.get_or_create_collection(
    name="ecommerce_products",
    embedding_function=emb_func
)

def load_and_index_products():
    """Load and index e-commerce products into ChromaDB"""
    with open('/Users/madmax_jos/Documents/ecommerce_products.json', 'r') as f:
        products = json.load(f)
    
    documents = []
    metadatas = []
    ids = []
    
    for i, product in enumerate(products):
        # Create rich text representation for embedding
        text = f"Product: {product['product']}. Category: {product['category']}. Description: {product['description']}. Specs: {product['specs']}. Price: {product['price']}. Rating: {product['rating']}."
        
        documents.append(text)
        metadatas.append({
            "product": product["product"],
            "category": product["category"],
            "description": product["description"],
            "specs": product["specs"],
            "price": product["price"],
            "rating": product["rating"],
            "reviews": product["reviews"]
        })
        ids.append(f"product_{i}")
    
    # Add to ChromaDB collection
    collection.upsert(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )
    
    print(f"✅ Indexed {len(documents)} products in vector database")
    return len(documents)
In [31]:
# Load the data
num_products = load_and_index_products()
✅ Indexed 100 products in vector database

3. Define Tools & Agent¶

Implementing your specific product_search and product_comparison tools.

In [35]:
from agno.tools import tool
from opentelemetry import trace
from openinference.semconv.trace import SpanAttributes
from agno.agent import Agent
from agno.models.openai import OpenAIChat

tracer = trace.get_tracer(__name__)


model = SentenceTransformer('all-MiniLM-L6-v2')

emb_func = ef.SentenceTransformerEmbeddingFunction(model_name='all-MiniLM-L6-v2')

# RAG Tool for product search
@tool
def product_search(query: str, category: str = None, price: str = None) -> str:
    """Search for products using semantic similarity from vector database.
    Args:
        query: The search keywords.
        category: The product category.
        price: The budget or price range (e.g., 'under $50').
    """
    with tracer.start_as_current_span(name="RAG", attributes={}) as span:
        # Include price in the semantic search query
        search_query = f"{query} {category or ''} {price or ''}".strip()
        span.set_attribute("input_value", search_query)
        
        # The rest of your code remains the same...
        query_embedding = embedding_model.encode([search_query])
        
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=3
        )
        
        if not results or not results.get("documents"):
            return "No matching products found. Try different keywords or categories."
        
        retrieved_docs = results["documents"][0]
        retrieved_meta = results["metadatas"][0]
        
        # Format product recommendations
        products_found = []
        for doc, meta in zip(retrieved_docs, retrieved_meta):
            product_info = f"""
**{meta['product']}**
- **Category**: {meta['category']}
- **Price**: {meta['price']}
- **Rating**: {meta['rating']} ({meta['reviews']})
- **Description**: {meta['description']}
- **Specs**: {meta['specs']}
            """
            products_found.append(product_info.strip())
        
        response = f"Found {len(products_found)} matching products:\n\n" + "\n\n".join(products_found)
        span.set_attribute("output_value", response)
        
        return response

@tool
def product_comparison(products: str) -> str:
    """Compare multiple products based on features and pricing"""
    product_list = [p.strip() for p in products.split(',')]
    comparison = "Product Comparison:\n\n"
    
    for product_name in product_list:
        results = collection.query(
            query_embeddings=embedding_model.encode([product_name]),
            n_results=1
        )
        
        if results["metadatas"] and results["metadatas"][0]:
            meta = results["metadatas"][0][0]
            comparison += f"**{meta['product']}**: {meta['price']} | {meta['rating']}★ | {meta['category']}\n"
    
    return comparison if len(product_list) > 1 else "Please specify multiple products to compare."
In [37]:
# Create E-commerce Q&A Agent
ecommerce_agent = Agent(
    name="EcommerceQA",
    role="AI E-commerce Assistant",
    model=OpenAIChat(id="gpt-4o"),
    instructions=(
        "You are a helpful e-commerce shopping assistant. "
        "Use product_search to find relevant products and product_comparison for comparisons. "
        "Answer naturally, recommend based on customer needs, and highlight key features, pricing, and ratings. "
        "Keep responses concise and helpful under 800 words."
    ),
    markdown=True,
    tools=[product_search, product_comparison],
)

# Test the agent
query = """
Recommend products for home organization under $50.
Also compare LiteProduct 1 vs MaxProduct 2.
"""

print("🤖 E-commerce Q&A Agent Ready!")
ecommerce_agent.print_response(query, stream=True)
🤖 E-commerce Q&A Agent Ready!
Output()


3. Execution & Multi-hop Reasoning¶

The query below requires the agent to find a product category, retrieve specific specs, and evaluate them against a battery requirement.

In [33]:
ecommerce_agent.run("Recommend tablets or smartphones for media consumption under $450 with at least 40h battery?")
Out[33]:
RunOutput(run_id='befcf277-2709-4339-ba3d-4192e3bd6030', agent_id='ecommerceqa', agent_name='EcommerceQA', session_id='eca46ba8-6e3e-4cfc-b71c-c61eeed33eca', parent_run_id=None, workflow_id=None, user_id=None, input=RunInput(input_content='Recommend tablets or smartphones for media consumption under $450 with at least 40h battery?', images=None, videos=None, audios=None, files=None), content='Here are some great options for tablets and smartphones suitable for media consumption under $450, all featuring a long battery life of at least 40 hours:\n\n1. **PremiumTablet 96**\n   - **Price**: $420\n   - **Rating**: 4.6/5 from 138 reviews\n   - **Features**: Designed for media consumption and note-taking with a 1080p resolution and USB-C connectivity.\n   - **Highlight**: Ideal for those who prioritize multimedia capabilities and a high level of performance.\n\n2. **LiteSmartphone 11**\n   - **Price**: $280\n   - **Rating**: 4.2/5 from 150 reviews\n   - **Features**: This smartphone comes with a 1080p display and is lightweight, perfect for everyday use and communication.\n   - **Highlight**: Best suited for users who want a budget-friendly option for regular communication and media consumption.\n\n3. **AdvancedTablet 7**\n   - **Price**: $320\n   - **Rating**: 4.6/5 from 175 reviews\n   - **Features**: Offers an advanced display with 8K resolution, along with USB-C connectivity.\n   - **Highlight**: Great for those looking for cutting-edge display technology in a tablet.\n\nEach of these options provides ample battery life to support prolonged media usage without frequent charging. If display quality and multimedia performance are your top priorities, "PremiumTablet 96" might be the best fit. For a more budget-conscious choice that still supports decent media features, consider "LiteSmartphone 11". If you want the latest display tech in a tablet, "AdvancedTablet 7" could be ideal.', content_type='str', reasoning_content=None, reasoning_steps=None, reasoning_messages=None, model_provider_data={'id': 'chatcmpl-DCJQgEDw2bldjyuct3FLM1BEIwgq6', 'system_fingerprint': 'fp_64dfa806c7'}, model='gpt-4o', model_provider='OpenAI', messages=[Message(id='9e37ea00-de24-44e9-9e37-1c291e573d9a', role='system', content='<your_role>\nAI E-commerce Assistant\n</your_role>\n\nYou are a helpful e-commerce shopping assistant. Use product_search to find relevant products and product_comparison for comparisons. Answer naturally, recommend based on customer needs, and highlight key features, pricing, and ratings. Keep responses concise and helpful under 800 words.\n\n<additional_information>\n- Use markdown to format your answers.\n</additional_information>', compressed_content=None, name=None, tool_call_id=None, tool_calls=None, audio=None, images=None, videos=None, files=None, audio_output=None, image_output=None, video_output=None, file_output=None, redacted_reasoning_content=None, provider_data=None, citations=None, reasoning_content=None, tool_name=None, tool_args=None, tool_call_error=None, stop_after_tool_call=False, add_to_agent_memory=True, from_history=False, metrics=Metrics(input_tokens=0, output_tokens=0, total_tokens=0, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), references=None, created_at=1771793143, temporary=False), Message(id='8a72f644-245f-4602-8a83-1e03851492fa', role='user', content='Recommend tablets or smartphones for media consumption under $450 with at least 40h battery?', compressed_content=None, name=None, tool_call_id=None, tool_calls=None, audio=None, images=None, videos=None, files=None, audio_output=None, image_output=None, video_output=None, file_output=None, redacted_reasoning_content=None, provider_data=None, citations=None, reasoning_content=None, tool_name=None, tool_args=None, tool_call_error=None, stop_after_tool_call=False, add_to_agent_memory=True, from_history=False, metrics=Metrics(input_tokens=0, output_tokens=0, total_tokens=0, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), references=None, created_at=1771793143, temporary=False), Message(id='6f91b138-ac5a-4015-9398-fdeafe9dd00d', role='assistant', content=None, compressed_content=None, name=None, tool_call_id=None, tool_calls=[{'id': 'call_BKwKbUusEjc4tpedcC2CHsWn', 'function': {'arguments': '{"query":"tablets or smartphones for media consumption under $450 with long battery life","category":"electronics"}', 'name': 'product_search'}, 'type': 'function'}], audio=None, images=None, videos=None, files=None, audio_output=None, image_output=None, video_output=None, file_output=None, redacted_reasoning_content=None, provider_data={'id': 'chatcmpl-DCJQcUgzZdZgHE015leYEftQgausS', 'system_fingerprint': 'fp_a9a5a50c8d'}, citations=None, reasoning_content=None, tool_name=None, tool_args=None, tool_call_error=None, stop_after_tool_call=False, add_to_agent_memory=True, from_history=False, metrics=Metrics(input_tokens=174, output_tokens=31, total_tokens=205, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), references=None, created_at=1771793143, temporary=False), Message(id='6883e2f0-c4a9-4844-9ff7-03d122192c9f', role='tool', content='Found 3 matching products:\n\n**PremiumTablet 96**\n- **Category**: Electronics\n- **Price**: $420\n- **Rating**: 4.6 (138)\n- **Description**: Premium tablet for media consumption and note‑taking.\n- **Specs**: 1080p resolution, 40h battery, USB‑C\n\n**LiteSmartphone 11**\n- **Category**: Electronics\n- **Price**: $280\n- **Rating**: 4.2 (150)\n- **Description**: Lightweight smartphone for everyday communication.\n- **Specs**: 1080p display, 40h battery, USB‑C\n\n**AdvancedTablet 7**\n- **Category**: Electronics\n- **Price**: $320\n- **Rating**: 4.6 (175)\n- **Description**: High-end tablet with advanced display technology.\n- **Specs**: 8K resolution, 40h battery, USB‑C', compressed_content=None, name=None, tool_call_id='call_BKwKbUusEjc4tpedcC2CHsWn', tool_calls=None, audio=None, images=None, videos=None, files=None, audio_output=None, image_output=None, video_output=None, file_output=None, redacted_reasoning_content=None, provider_data=None, citations=None, reasoning_content=None, tool_name='product_search', tool_args={'query': 'tablets or smartphones for media consumption under $450 with long battery life', 'category': 'electronics'}, tool_call_error=False, stop_after_tool_call=False, add_to_agent_memory=True, from_history=False, metrics=Metrics(input_tokens=0, output_tokens=0, total_tokens=0, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), references=None, created_at=1771793146, temporary=False), Message(id='006c80cf-2a49-4698-ab9f-a99ee80c7942', role='assistant', content='Here are some great options for tablets and smartphones suitable for media consumption under $450, all featuring a long battery life of at least 40 hours:\n\n1. **PremiumTablet 96**\n   - **Price**: $420\n   - **Rating**: 4.6/5 from 138 reviews\n   - **Features**: Designed for media consumption and note-taking with a 1080p resolution and USB-C connectivity.\n   - **Highlight**: Ideal for those who prioritize multimedia capabilities and a high level of performance.\n\n2. **LiteSmartphone 11**\n   - **Price**: $280\n   - **Rating**: 4.2/5 from 150 reviews\n   - **Features**: This smartphone comes with a 1080p display and is lightweight, perfect for everyday use and communication.\n   - **Highlight**: Best suited for users who want a budget-friendly option for regular communication and media consumption.\n\n3. **AdvancedTablet 7**\n   - **Price**: $320\n   - **Rating**: 4.6/5 from 175 reviews\n   - **Features**: Offers an advanced display with 8K resolution, along with USB-C connectivity.\n   - **Highlight**: Great for those looking for cutting-edge display technology in a tablet.\n\nEach of these options provides ample battery life to support prolonged media usage without frequent charging. If display quality and multimedia performance are your top priorities, "PremiumTablet 96" might be the best fit. For a more budget-conscious choice that still supports decent media features, consider "LiteSmartphone 11". If you want the latest display tech in a tablet, "AdvancedTablet 7" could be ideal.', compressed_content=None, name=None, tool_call_id=None, tool_calls=None, audio=None, images=None, videos=None, files=None, audio_output=None, image_output=None, video_output=None, file_output=None, redacted_reasoning_content=None, provider_data={'id': 'chatcmpl-DCJQgEDw2bldjyuct3FLM1BEIwgq6', 'system_fingerprint': 'fp_64dfa806c7'}, citations=None, reasoning_content=None, tool_name=None, tool_args=None, tool_call_error=None, stop_after_tool_call=False, add_to_agent_memory=True, from_history=False, metrics=Metrics(input_tokens=416, output_tokens=352, total_tokens=768, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), references=None, created_at=1771793146, temporary=False)], metrics=Metrics(input_tokens=590, output_tokens=383, total_tokens=973, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=<agno.utils.timer.Timer object at 0x1b4eef590>, time_to_first_token=2.9210733649961185, duration=7.890739890994155, provider_metrics=None, additional_metrics=None), additional_input=None, tools=[ToolExecution(tool_call_id='call_BKwKbUusEjc4tpedcC2CHsWn', tool_name='product_search', tool_args={'query': 'tablets or smartphones for media consumption under $450 with long battery life', 'category': 'electronics'}, tool_call_error=False, result='Found 3 matching products:\n\n**PremiumTablet 96**\n- **Category**: Electronics\n- **Price**: $420\n- **Rating**: 4.6 (138)\n- **Description**: Premium tablet for media consumption and note‑taking.\n- **Specs**: 1080p resolution, 40h battery, USB‑C\n\n**LiteSmartphone 11**\n- **Category**: Electronics\n- **Price**: $280\n- **Rating**: 4.2 (150)\n- **Description**: Lightweight smartphone for everyday communication.\n- **Specs**: 1080p display, 40h battery, USB‑C\n\n**AdvancedTablet 7**\n- **Category**: Electronics\n- **Price**: $320\n- **Rating**: 4.6 (175)\n- **Description**: High-end tablet with advanced display technology.\n- **Specs**: 8K resolution, 40h battery, USB‑C', metrics=Metrics(input_tokens=0, output_tokens=0, total_tokens=0, cost=None, audio_input_tokens=0, audio_output_tokens=0, audio_total_tokens=0, cache_read_tokens=0, cache_write_tokens=0, reasoning_tokens=0, timer=None, time_to_first_token=None, duration=None, provider_metrics=None, additional_metrics=None), child_run_id=None, stop_after_tool_call=False, created_at=1771793146, requires_confirmation=None, confirmed=None, confirmation_note=None, requires_user_input=None, user_input_schema=None, user_feedback_schema=None, answered=None, external_execution_required=None, external_execution_silent=None, approval_type=None)], images=None, videos=None, audio=None, files=None, response_audio=None, citations=None, references=None, metadata=None, session_state={'current_session_id': 'eca46ba8-6e3e-4cfc-b71c-c61eeed33eca', 'current_run_id': 'befcf277-2709-4339-ba3d-4192e3bd6030'}, created_at=1771792297, events=None, status=<RunStatus.completed: 'COMPLETED'>, requirements=None, workflow_step_id=None)

Agent Evals - System and Model Evals¶

¶

Untitled 20.jpg

Untitled 21.jpg

In [ ]:
queries = [
    "Recommend products for home organization under $50. Also compare LiteProduct 1 vs MaxProduct 2",
    "Suggest the best smart home devices under $200. Prioritize locks and thermostats with ratings above 4.5",
    "Find top audio devices with active noise cancellation under $150 and at least 100 reviews",
    "Recommend laptops or desktops for heavy multitasking with SSD > 1TB and rating above 4.5",
    "Show fitness equipment for home workouts under $400, and compare UltraBike 19 vs EssentialBike 69",
    "List eco-friendly products across any category under $60 and sort by rating",
    "Recommend tablets or smartphones for media consumption under $450 with at least 40h battery",
    "Find premium projectors for home theater under $700 and explain differences between MaxProjector 68, SmartProjector 15, and PremiumProjector 16",
    "Suggest beauty kits or tools under $50 for daily routines with ratings above 4.3",
    "Recommend wearable accessories for everyday use under $80, focusing on PremiumAccessory 32 and MaxAccessory 97",
    "Find smart locks and cameras for an apartment security setup under $300 total budget",
    "Recommend beginner-friendly books related to kits and tools under $30 and at least 40 reviews",
    "Suggest audio products suitable for long battery life (≥40h) under $200",
    "Find computing devices (laptops, desktops, workstations, servers) sorted by price-to-RAM value",
    "Recommend a starter home office setup with one computing device, one peripheral kit, and one office accessory, total under $1500"
]
In [11]:
queries = [
    "Recommend products for home organization under $50. Also compare LiteProduct 1 vs MaxProduct 2",
    "Suggest the best smart home devices under $200. Prioritize locks and thermostats with ratings above 4.5"]
In [13]:
for q in queries:
  response = ecommerce_agent.run(q)
In [14]:
print('#### Installing arize SDK')

! pip install "arize[Tracing]>=7.1.0, <8.0.0"

print('#### arize SDK installed!')

import os

from datetime import datetime

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

client = ArizeExportClient()

print('#### Exporting your primary dataset into a dataframe.')



primary_df = client.export_model_to_df(
    space_id='U3BhY2U6MzgyNDg6V1Q0Lw==',
    model_id='ecom-agent-eval-v4',
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat('2026-02-13T05:00:00.000+00:00'),
    end_time=datetime.fromisoformat('2026-02-21T04:59:59.999+00:00'),
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)
#### Installing arize SDK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting arize<8.0.0,>=7.1.0 (from arize[Tracing]<8.0.0,>=7.1.0)
  Downloading arize-7.52.0-py3-none-any.whl.metadata (16 kB)
Requirement already satisfied: googleapis-common-protos<2,>=1.51.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (1.66.0)
Requirement already satisfied: pandas<3,>=0.25.3 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.2.3)
Requirement already satisfied: protobuf<7,>=4.21.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (5.29.6)
Requirement already satisfied: pyarrow>=0.15.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (19.0.0)
Requirement already satisfied: pydantic<3,>=2.0.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.10.6)
Collecting requests-futures==1.0.0 (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0)
  Downloading requests_futures-1.0.0-py2.py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: tqdm<5,>=4.60.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (4.67.1)
Requirement already satisfied: requests>=1.2.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from requests-futures==1.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.32.3)
Requirement already satisfied: deprecated in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize[Tracing]<8.0.0,>=7.1.0) (1.2.18)
Requirement already satisfied: openinference-semantic-conventions<1,>=0.1.12 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize[Tracing]<8.0.0,>=7.1.0) (0.1.26)
Requirement already satisfied: opentelemetry-semantic-conventions<1,>=0.43b0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from arize[Tracing]<8.0.0,>=7.1.0) (0.60b1)
Requirement already satisfied: opentelemetry-api==1.39.1 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from opentelemetry-semantic-conventions<1,>=0.43b0->arize[Tracing]<8.0.0,>=7.1.0) (1.39.1)
Requirement already satisfied: typing-extensions>=4.5.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from opentelemetry-semantic-conventions<1,>=0.43b0->arize[Tracing]<8.0.0,>=7.1.0) (4.15.0)
Requirement already satisfied: importlib-metadata<8.8.0,>=6.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from opentelemetry-api==1.39.1->opentelemetry-semantic-conventions<1,>=0.43b0->arize[Tracing]<8.0.0,>=7.1.0) (8.5.0)
Requirement already satisfied: numpy>=1.26.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pandas<3,>=0.25.3->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pandas<3,>=0.25.3->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pandas<3,>=0.25.3->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pandas<3,>=0.25.3->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2025.1)
Requirement already satisfied: annotated-types>=0.6.0 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pydantic<3,>=2.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from pydantic<3,>=2.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.27.2)
Requirement already satisfied: wrapt<2,>=1.10 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from deprecated->arize[Tracing]<8.0.0,>=7.1.0) (1.17.2)
Requirement already satisfied: six>=1.5 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas<3,>=0.25.3->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (1.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from requests>=1.2.0->requests-futures==1.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from requests>=1.2.0->requests-futures==1.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from requests>=1.2.0->requests-futures==1.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from requests>=1.2.0->requests-futures==1.0.0->arize<8.0.0,>=7.1.0->arize[Tracing]<8.0.0,>=7.1.0) (2025.1.31)
Requirement already satisfied: zipp>=3.20 in /Users/madmax_jos/Documents/agentaicourse/venv/lib/python3.12/site-packages (from importlib-metadata<8.8.0,>=6.0->opentelemetry-api==1.39.1->opentelemetry-semantic-conventions<1,>=0.43b0->arize[Tracing]<8.0.0,>=7.1.0) (3.21.0)
Downloading arize-7.52.0-py3-none-any.whl (238 kB)
Downloading requests_futures-1.0.0-py2.py3-none-any.whl (7.4 kB)
Installing collected packages: requests-futures, arize
  Attempting uninstall: requests-futures
    Found existing installation: requests-futures 1.0.2
    Uninstalling requests-futures-1.0.2:
      Successfully uninstalled requests-futures-1.0.2
  Attempting uninstall: arize
    Found existing installation: arize 8.2.1
    Uninstalling arize-8.2.1:
      Successfully uninstalled arize-8.2.1
Successfully installed arize-7.52.0 requests-futures-1.0.0

[notice] A new release of pip is available: 25.0 -> 26.0.1
[notice] To update, run: pip install --upgrade pip
#### arize SDK installed!
  arize.utils.logging | INFO | Creating named session as 'python-sdk-arize_python_export_client-c7253f78-f4c3-4d03-91cb-88909eb05630'.
#### Exporting your primary dataset into a dataframe.
  arize.utils.logging | INFO | Fetching data...
  arize.utils.logging | INFO | Starting exporting...
  exporting 46 rows: 100%|█████████████████████████| 46/46 [00:00, 231.21 row/s]
In [16]:
primary_df.columns
Out[16]:
Index(['attributes.llm.token_count.completion_details.output', 'event.names',
       'attributes.exception.type', 'attributes.llm.prompt_template.version',
       'attributes.retrieval.documents', 'parent_id',
       'attributes.llm.cost.completion_details.audio', 'time',
       'attributes.agno.agent', 'attributes.llm.token_count.prompt',
       'attributes.llm.cost.prompt', 'attributes.embedding.model_name',
       'attributes.llm.prompt_template.template', 'name',
       'attributes.openinference.span.kind', 'attributes.exception.stacktrace',
       'attributes.input.value', 'attributes.reranker.query',
       'attributes.llm.token_count.completion_details.audio',
       'attributes.llm.provider',
       'attributes.llm.cost.prompt_details.cache_read',
       'attributes.llm.cost.completion_details.reasoning',
       'attributes.llm.cost.completion_details.output',
       'attributes.llm.token_count.total', 'attributes.output.mime_type',
       'attributes.reranker.model_name',
       'attributes.llm.token_count.prompt_details.audio',
       'attributes.llm.invocation_parameters', 'attributes.llm.system',
       'attributes.agno.run.id', 'attributes.llm.token_count.completion',
       'events', 'status_code', 'latency_ms', 'end_time',
       'attributes.input.mime_type',
       'attributes.llm.token_count.prompt_details.cache_read',
       'context.trace_id', 'attributes.llm.prompt_template.variables',
       'attributes.graph.node.name', 'context.span_id', 'attributes.tool.name',
       'attributes.llm.cost.completion',
       'attributes.llm.cost.prompt_details.input', 'attributes.agno.agent.id',
       'attributes.output.value', 'attributes.llm.cost.prompt_details.audio',
       'attributes.exception.message', 'attributes.llm.cost.total',
       'start_time', 'attributes.agno.tools', 'attributes.session.id',
       'attributes.graph.node.id', 'event.timestamps',
       'attributes.llm.input_messages',
       'attributes.llm.token_count.prompt_details.input', 'event.attributes',
       'attributes.tool.description', 'attributes.llm.model_name',
       'attributes.agent.name', 'attributes.llm.tools', 'status_message',
       'attributes.tool.parameters',
       'attributes.llm.token_count.completion_details.reasoning',
       'attributes.llm.output_messages'],
      dtype='object')
In [17]:
RAG_RELEVANCY_PROMPT_TEMPLATE = """
You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {{input}}
    ************
    [Reference text]: {{documents}}
    ************
    [END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can help answer the Question. First, write out in a step by step manner
an EXPLANATION to show how to arrive at the correct answer. Avoid simply stating the correct answer
at the outset. Your response LABEL must be single word, either "relevant" or "unrelated", and
should not contain any text or characters aside from that word. "unrelated" means that the
reference text does not help answer to the Question. "relevant" means the reference text directly
answers the question.

Example response:
LABEL: "relevant" or "unrelated"
************
"""
In [18]:
spans_df = primary_df[
    [
        "name",
        "context.span_id",
        "attributes.openinference.span.kind",
        "context.trace_id",
        "attributes.input.value",
        "attributes.retrieval.documents",
    ]
]
In [ ]:
 
In [23]:
filtered_df = spans_df[
    (spans_df["attributes.openinference.span.kind"] == "retriever")
    & (spans_df["attributes.retrieval.documents"].notnull())
]

filtered_df = filtered_df.rename(
    columns={"attributes.input.value": "input", "attributes.retrieval.documents": "documents"}
)

filtered_df
Out[23]:
name context.span_id attributes.openinference.span.kind context.trace_id input documents
3 RAG 0390d509c7879064 retriever 15e205614934c48ad1f2d44a65aaa57f home organization under $50 [{'document.content': 'MaxProduct 2 in Home Or...
9 RAG b101ec77316ced40 retriever df70b5d3c8f7f28863d774f450821922 home organization under $50 [{'document.content': 'MaxProduct 2 in Home Or...
In [24]:
from openinference.instrumentation import suppress_tracing
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals import create_classifier

llm = LLM(provider="openai", model="gpt-5")

relevancy_evaluator = create_classifier(
    name="RAG Relevancy",
    llm=llm,
    prompt_template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    choices={"relevant": 1.0, "unrelated": 0.0},
)

with suppress_tracing():
    results_df = await async_evaluate_dataframe(
        dataframe=filtered_df,
        evaluators=[relevancy_evaluator],
    )
results_df.head()
Evaluating Dataframe |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s
Out[24]:
name context.span_id attributes.openinference.span.kind context.trace_id input documents RAG Relevancy_execution_details RAG Relevancy_score
3 RAG 0390d509c7879064 retriever 15e205614934c48ad1f2d44a65aaa57f home organization under $50 [{'document.content': 'MaxProduct 2 in Home Or... {'status': 'COMPLETED', 'exceptions': [], 'exe... {'name': 'RAG Relevancy', 'score': 0.0, 'label...
9 RAG b101ec77316ced40 retriever df70b5d3c8f7f28863d774f450821922 home organization under $50 [{'document.content': 'MaxProduct 2 in Home Or... {'status': 'COMPLETED', 'exceptions': [], 'exe... {'name': 'RAG Relevancy', 'score': 0.0, 'label...
In [25]:
from arize.pandas.logger import Client
from phoenix.evals.utils import to_annotation_dataframe
import ast

import pandas as pd
client = Client()

rag_eval_df = to_annotation_dataframe(results_df)
rag_eval_df = rag_eval_df.rename(columns={
    "label": "eval.rag.label",
    "score": "eval.rag.score",
    "explanation": "eval.rag.explanation",
    "metadata": "eval.rag.metadata"
})

client.log_evaluations_sync(rag_eval_df, 'ecom-agent-eval-v4')
  arize.utils.logging | INFO | The following columns do not follow the evaluation column naming convention and will be ignored: eval.rag.metadata, annotation_name and annotator_kind. Evaluation columns must be named as follows: - eval.<your-eval-name>.label- eval.<your-eval-name>.score- eval.<your-eval-name>.explanation
/var/folders/77/0w3rw50s6_nc28qbf0hlwy2c0000gp/T/ipykernel_28427/1957919176.py:8: DeprecationWarning: Positional arguments for to_annotation_dataframe are deprecated and will be removed in a future version. Please use keyword arguments instead.
  rag_eval_df = to_annotation_dataframe(results_df)
  arize.utils.logging | INFO | ✅ All 2 evaluation data have been logged successfully for model 'ecom-agent-eval-v4'!
Out[25]:
records_updated: 2

Click on the retriever spans within each trace to view detailed evaluation results. You can also filter by evaluation outcome to quickly identify which queries successfully retrieved the most relevant documents.

image.png

In this section, we will walk you through how to set up and run evaluations in the Arize UI. Specifically, we will be running a trace level evaluation to determine the answer quality of our agent.¶

Untitled 17.jpg)

Untitled 18.jpg

  1. In the project containing your traces, go to Eval Tasks and select LLM as a Judge.

  2. Name your task and schedule it to run on historical data. Each task can include multiple evaluators, but this walkthrough focuses on setting up one.

  3. Choose a trace-level evaluation.

  4. From the predefined templates, select Q&A or another template of your choice. You can also create a custom evaluation. If you define your own, ensure the variables align with your trace structure and specify the output labels (rails).

  5. Click Create Evals. Your evaluations will begin running and will appear on your existing traces. Look for the eval result on the top span for each trace.

Trace-Level Evaluation in the Arize UI¶

image.png

evals.png

4. Multi-Layered Evaluation¶

Updated evaluators for tool picking, response correctness, and tone.

image.png

In [ ]:
 
In [ ]: